Team Notes

****Delete this notes section eventually****

Outstanding conceptual questions

  • How do we decide which classification model(s) to try? Just try as many as we have time for and then compare misclassificaiton rates?
  • How does the video about the importance of splitting the data before starting relate to K-fold CV, which we learned about next as the better alternative?
  • How do we deal with the 59 variables, given that it’s bad data mining practice to drop anything?
  • What is a cox validation study?
  • How did ProPublica decide on a logistic regression using low/high as the outcome? glm(formula = score_factor ~ gender_factor + age_factor + race_factor + priors_count + crime_factor + two_year_recid, family = "binomial", data = df)
  • Why are the datasets of violent crimes and regular crimes separate? Why are there different sample sizes? Why can’t we just merge them?

To do list

  • High level of effort
    • Validation: Figure out how we should be splitting our data into training/validation from the beginning so we don’t make this mistake. Then document it at the start of section 2.
    • Improve KR’s descriptions of possible models in 2a.1: Model Selection and validation. You can use the book’s section called “4.5 A Comparison of Classification Methods” as a resource. Then decide which models to use.
  • Medium level of effort
  • Low level of effort
    • Switch to relative path referencing while importing CSVs

Tasks for later

  • Medium level of effort
    • Model Selection and validation (2a.1): run our first model!
  • High level of effort
    • Figure out the best version of a given model (ex. logistic regression) using test/training error and figuring out if we can incorporate sensitivity/specifity in choosing the version of the model.
    • Compare models (ex. logistic regression vs. random forrest)
    • Make pretty pictures about how our models performed (lift charts, ROC curves, sensitivity vs. specificity)

List of resources

Our github repo
ProPublica main article
ProPublica methodology
ProPublica github
Raw version of COMPAS survey
Columbia article on Risk as proxy for race (2010) Article on Risk, Race, and Recidivism
Original publication by COMPAS creator
Assignment instructions on Canvas
Project description on Canvas

****End of team notes****

#Suppressing warnings about libraries
#Import  libraries (currently overdoing it)
library(tidyverse)    
library(ggplot2)      # graphics library
library(gridExtra)    # For displaying graphs side-by-side
library(knitr)        # contains knitting control
library(tree)         # For the tree-fitting 'tree' function
library(randomForest) # For random forests
library(rpart)        # For nicer tree fitting
library(partykit)     # For nicer tree plotting
library(boot)         # For cv.glm
library(leaps)        # needed for regsubsets (though maybe not relevant b/c our outcome vars are binary)
library(plotly)
library(rsample)      # data splitting, just trying to see if works (for naive bayes)
library(dplyr)        # data transformation, just trying to see if works (naive bayes)
library(caret)        # naive bayes package
library(h2o)          # naive bayes package
library(MASS)         # For LDA

#Format numbers so that they are not in scientific notation.
options(scipen = 4)

Introduction

Background

In 2016, ProPublica published a story on how a commonly-used pre-trial risk assessment tool called the COMPAS is racially biased. The journalists showed that in spite of a 2009 validation study showing similar accuracy rates for black and white men (67 percent versus 69 percent), the inacccuracies were in opposite directions for the two groups. This racial bias of the tool is reflected in their their published table replicated below:

Type of error White African American
Labeled Higher Risk, But Didn’t Re-Offend 23.5% 44.9%
Labeled Lower Risk, Yet Did Re-Offend 47.7% 28.0%

Additional information on the COMPAS

The COMPAS is widely used across states [add more here from ProPublica article]. It gives people a risk score ranging from 1 to 10, where risk scores of 1 to 4 are “Low”, 5 to 7 are labeled “Medium”, and 8 to 10 are labeled “High.” Although race is not included in its 137 questions, some of the questions such as how often people moved can be linked to poverty…[add more here from]

Prompt

  • Using the available data, construct a Risk Assessment Instrument (RAI) for predicting two-year recidivism.

    • Evaluate the predictive performance of your model.
    • What are the most important predictors of recidivism?
  • Create an RAI for predicting violent recidivism.

    • Evaluate the predictive performance of your model.
    • What are the most important predictors of violent recidivism?
    • How do they compare to the important predictors of general recidivism?
  • Assess whether the RAIs from (A) and (B) are equally predictive across race/ethnicity groups? How about across age and gender_factor groups?

  • Compare your RAIs to the COMPAS RAI.

    • Do your RAIs perform better or worse than COMPAS?
    • Do your RAIs produce similar classifications to COMPAS?
    • Can you identify any systematic differences between your classifications and those of COMPAS?

Goal

Our goal is to investigate whether it is possible to create a Risk Assessment Instrument (RAI) that is more accurate and less racist than the COMPAS.

Data

This analysis is run using ProPublica’s data. In order to have clean data, it is necessary to remove certain observations. Fortunately, ProPublica staff published their Jupyter notebook with data cleaning steps in R, so our data is cleaned in the same way (ex. dropping people whose charge date was not w/in 30 days ). However, although we drop the same observations (rows) as they do, we have adapted their code so that it does not drop any attributes (columns) because this would be bad practice for data mining.

###################################
# Read in all ProPublica datasets #
###################################
compas.scores.raw <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\compas-scores-raw.csv")

compas.scores <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\compas-scores.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [45]
compas.scores.two.years <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\compas-scores-two-years.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [40], 'priors_count' => 'priors_count_1' [49]
compas.scores.two.years.violent <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\compas-scores-two-years-violent.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [40], 'priors_count' => 'priors_count_1' [49], 'two_year_recid'
## => 'two_year_recid_1' [54]
cox.parsed <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\cox-parsed.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [40], 'priors_count' => 'priors_count_1' [49]
cox.violent.parsed <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\cox-violent-parsed.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [40], 'priors_count' => 'priors_count_1' [49]
#Next step 1 (low LOE): Somebody fix so the file paths aren't just for my local directories. Here are three options in order of how ideal they would be:
#   The propublica github files
#   My forked version of the propublica github files:  https://github.com/kaylareiman1/compas-analysis
#   Our copied and pasted version of the propublica github files: https://github.com/kaylareiman1/RecidivismPrediction/tree/master/datasets

####################################
# Adapt ProPublica's Data Cleaning #
####################################
#Note: This code is copied directly from ProPublica. Here is why they dropped data:
      # - dropped if charge date not w/in 30 days  
      # - Coded the recidivist flag -- is_recid -- to be -1 if could not find a compas case at all.  
      # - Ordinary traffic offenses removed  
      # - Filtered the underlying data from Broward county to include:  
      #   + people who had either recidivated in two years  
      #   + had at least two years outside of a correctional facility.  

#Unlike ProPublica, we won't be dropping any attributes because that may bias our model.
### All recidivism ###

#Dropping bad data
recid.all <-compas.scores.two.years %>% 
        filter(days_b_screening_arrest <= 30) %>%
        filter(days_b_screening_arrest >= -30) %>%
        filter(is_recid != -1) %>%
        filter(c_charge_degree != "O") %>%
        filter(score_text != 'N/A')

#Recoding variables
recid.all$length_of_stay <- as.numeric(as.Date(recid.all$c_jail_out) - as.Date(recid.all$c_jail_in))
recid.all <- mutate(recid.all, crime_factor = factor(c_charge_degree)) %>%
      mutate(age_factor = as.factor(age_cat)) %>%
      within(age_factor <- relevel(age_factor, ref = 1)) %>%
      mutate(race_factor = factor(race)) %>%
      within(race_factor <- relevel(race_factor, ref = 3)) %>%
      mutate(gender_factor = factor(sex, labels= c("Female","Male"))) %>%
      within(gender_factor <- relevel(gender_factor, ref = 2)) %>%
      mutate(score_factor = factor(score_text != "Low", labels = c("LowScore","HighScore")))


### Violent recidivism ###
#Dropping bad data
recid.vio <- compas.scores.two.years.violent %>% 
        filter(days_b_screening_arrest <= 30) %>%
        filter(days_b_screening_arrest >= -30) %>% 
        filter(is_recid != -1) %>%
        filter(c_charge_degree != "O") %>%
        filter(v_score_text != 'N/A')

#Recoding variables
#KR note: Why is race coded differently? Is there a mistake?
recid.vio <- mutate(recid.vio, crime_factor = factor(c_charge_degree)) %>%
      mutate(age_factor = as.factor(age_cat)) %>%
      within(age_factor <- relevel(age_factor, ref = 1)) %>%
      mutate(race_factor = factor(race,
                                  labels = c("African-American", 
                                             "Asian",
                                             "Caucasian", 
                                             "Hispanic", 
                                             "Native American",
                                             "Other"))) %>%
      within(race_factor <- relevel(race_factor, ref = 3)) %>%
      mutate(gender_factor = factor(sex, labels= c("Female","Male"))) %>%
      within(gender_factor <- relevel(gender_factor, ref = 2)) %>%
      mutate(score_factor = factor(v_score_text != "Low", labels = c("LowScore","HighScore")))



####################################
# Split the data?????????????????? #
####################################
#Per Saturday's call, maybe this is where we should randomly split the data. If so, we should use a seed so it's the same every time.

Section 1: Exploratory Data Analysis

1a: Overview of datasets

[KR note: must check whether 1 person per row or different row driver]

This project uses 2 datasets from ProPublica’s github repository, which are described in the table below:

Dataset name original obs obs after cleaning # varibles
recid.all 7214 6172 59
recid.vio 4743 4020 59

Out of the 59 variables in each dataset, we first looked at the variables in the logistic that ProPublica ran to see how well recidivism could predict whether somebody was flagged as low-risk. Their code is pasted below:
glm(formula = score_factor ~ gender_factor + age_factor + race_factor + priors_count + crime_factor + two_year_recid, family = "binomial", data = df)

An overview of variables that ProPublica used is below: [KR note: this dictionary could be better.] - two_year_recid: this is the variable that ProPublica used to indicate recidivism in each dataset.
- score_factor: the binary variable indicating low risk vs. medium/high risk. - Other covariates: -gender_factor: male vs. female -age_factor: a 3-level categorical variable for age -race_factor: from the original race variable. As will be shown below, most categories have very few observations.
-crime_factor: felonies vs. misdimeanors -priors_count: prior infractions?

A list of all of the variables in the dataset is shown below:

#It'll be ideal if we can delete this section eventually (or make it prettier). However, I'm leaving this here so you can get to know the data better without running everything from scratch.

colnames.all <- colnames(recid.all)
colnames.vio <- colnames(recid.vio)

#Print full list of columns
colnames.all
##  [1] "id"                      "name"                   
##  [3] "first"                   "last"                   
##  [5] "compas_screening_date"   "sex"                    
##  [7] "dob"                     "age"                    
##  [9] "age_cat"                 "race"                   
## [11] "juv_fel_count"           "decile_score"           
## [13] "juv_misd_count"          "juv_other_count"        
## [15] "priors_count"            "days_b_screening_arrest"
## [17] "c_jail_in"               "c_jail_out"             
## [19] "c_case_number"           "c_offense_date"         
## [21] "c_arrest_date"           "c_days_from_compas"     
## [23] "c_charge_degree"         "c_charge_desc"          
## [25] "is_recid"                "r_case_number"          
## [27] "r_charge_degree"         "r_days_from_arrest"     
## [29] "r_offense_date"          "r_charge_desc"          
## [31] "r_jail_in"               "r_jail_out"             
## [33] "violent_recid"           "is_violent_recid"       
## [35] "vr_case_number"          "vr_charge_degree"       
## [37] "vr_offense_date"         "vr_charge_desc"         
## [39] "type_of_assessment"      "decile_score_1"         
## [41] "score_text"              "screening_date"         
## [43] "v_type_of_assessment"    "v_decile_score"         
## [45] "v_score_text"            "v_screening_date"       
## [47] "in_custody"              "out_custody"            
## [49] "priors_count_1"          "start"                  
## [51] "end"                     "event"                  
## [53] "two_year_recid"          "length_of_stay"         
## [55] "crime_factor"            "age_factor"             
## [57] "race_factor"             "gender_factor"          
## [59] "score_factor"
#Print columns that differ
colnames.vio[(colnames.vio != colnames.all)]
## [1] "two_year_recid_1"
colnames.all[(colnames.vio != colnames.all)]
## [1] "length_of_stay"

Given the multiple posibilities for recidivism outcome variables, the tables below show the checks that were run to determine that two_year_recid is the correct outcome variable to use in both datasets. [KR note: mabe delete this later. But for now, it would be great if somebody can confirm that you agree with this conclusion or decide whether we should create a new variable for violent recidivism or something].

#Same comment as above: It'll be ideal if we can delete this section eventually (or make it prettier). However, I'm leaving this here so you can get to know the data better without running everything from scratch.

#Check how the recidivism variables relate
kable(recid.all %>%
  group_by (two_year_recid, is_recid, is_violent_recid, violent_recid) %>%
  summarize(count = n()), caption = "How variables relate in recid.all dataset")
How variables relate in recid.all dataset
two_year_recid is_recid is_violent_recid violent_recid count
0 0 0 NA 3182
0 1 0 NA 141
0 1 1 NA 40
1 1 0 NA 2157
1 1 1 NA 652
kable(recid.vio %>%
  group_by (two_year_recid, two_year_recid_1, is_recid, is_violent_recid, violent_recid) %>%
  summarize(count = n()), caption = "How variables relate in recid.vio dataset")
How variables relate in recid.vio dataset
two_year_recid two_year_recid_1 is_recid is_violent_recid violent_recid count
0 0 0 0 NA 3187
0 0 1 0 NA 141
0 0 1 1 NA 40
1 1 1 1 NA 652

1b: Demographic Characteristics

Note: These graphs use the full dataset (not limited to violent crime) The univariate graphs below show that the most categories in the dataset were men, black people, and those between the age of of 20 and 45.

#These three graphs are identical, except with different input variables.
gender_factor.uni.all <- ggplot(recid.all, aes(x=gender_factor))
gender_factor.uni.all + 
  geom_bar(stat = "count", na.rm = F) + 
  ggtitle("Univariate Gender Breakdown") + 
  guides(fill = FALSE) + 
  xlab("Gender") + 
  ylab("Number of People")

race_factor.uni.all <- ggplot(recid.all, aes(x=race_factor))
race_factor.uni.all + 
  geom_bar(stat = "count", na.rm = F) + 
  ggtitle("Univariate Race Breakdown") + 
  guides(fill = FALSE) + 
  xlab("Race") + 
  ylab("Number of People")

age_factor.uni.all <- ggplot(recid.all, aes(x=age_factor))
age_factor.uni.all + 
  geom_bar(stat = "count", na.rm = F) + 
  ggtitle("Univariate Age Breakdown") + 
  guides(fill = FALSE) + 
  xlab("Age") + 
  ylab("Number of People")

1c: Recidivism Rates

The table below shows how common two-year recidivism of both types was in our dataset.

rate.all <- recid.all %>%
  summarize(100*round(mean(two_year_recid), 2))
rate.vio <- recid.vio %>%
  summarize(100*round(mean(two_year_recid), 2))
Dataset Two-year Rate
All recidivism (recid.all) 46%
Violent recidivism (recid.vio) 16%

1d: COMPAS Predictions

The graphs below show participants’ 10-point scores, where a score of 10 indicates the highest recidivism potential. Judges were often shown categories of low, medium, and high risk. The ProPublica analysis combined the categories of medium and high in order to create a binary variable called score_factor with two categories.

#Create histograms of decile scores for all crime, color coded by the category. 
ranking.all <- ggplot(recid.all, aes(x=decile_score, fill = score_text))+
  geom_bar(stat = "count", bin = 1) +
  ggtitle("Prediction Metrics\nAll data") + 
  xlab("Prediction") + 
  guides(fill = FALSE) + 
  ylab("Number of people") 

#Note that v_score_text and v_decile_score are different from decile_score and score_text
ranking.vio <- ggplot(recid.vio, aes(x=v_decile_score, fill = v_score_text))+
  geom_bar(stat = "count", bin = 1) +
  ggtitle("Prediction Metrics\nViolence data") + 
  xlab("Prediction")+ 
  ylab(NULL)

grid.arrange(ranking.all, ranking.vio, ncol = 2)

1e: Accuracy of Predictions by Demographic Group

[KR note to team: I actually expected these wouldn’t look alike at all. But by the time I finished this, I’d spent so much time adapting code from our good ol’ NLSY project that now I want to keep them.] Initial data exploration showed that the trends in recidivism were similar to predictions. However, a few limitations should be noted:
- The higher rate of recidivism among African-American respondents cannot be separated from bias in policing [cite Michelle Alexander].
- This includes no interactions between terms demographic characteristics (ex. race and gender)
- As ProPublica pointed out, these charts do not differentiate between false positives and false negatives.

Additionally, we already know from the univariate race graphs that the recidivism rates for groups other than White, Hispanic, and African-American people will be distorted by a small sample size. [KR note: perhaps we should drop these obs or create a new race var?]

#Note -- just copying and pasting same charts for different characteristics. Could create a function instead. Also, note that I only ran this for recid.all because running it for recid.vio seemed like too many charts. 

### gender_factor ###
# Bivariate w/ decile score 
gender_factor.bi.conf.all <- recid.all %>%
  group_by(gender_factor) %>%
  summarize(Avgdecile_score = mean(decile_score),
            upper = t.test(decile_score)$conf.int[1],
            lower = t.test(decile_score)$conf.int[2])

gender_factor.bi.all.dec <- ggplot(data = gender_factor.bi.conf.all, aes(x = gender_factor, y = Avgdecile_score)) +  geom_bar(stat = "identity") +
  xlab("gender_factor") + 
  ylab("Average prediction") +
  ggtitle("Prediction from COMPAS") + 
  guides(fill = FALSE) +
  geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
  theme(text = element_text(size=12))

# Bivariate w/ recid 
gender_factor.bi.conf.all <- recid.all %>%
  group_by(gender_factor) %>%
  summarize(Avgtwo_year_recid = mean(two_year_recid),
            upper = t.test(two_year_recid)$conf.int[1],
            lower = t.test(two_year_recid)$conf.int[2])

gender_factor.bi.all.recid <- ggplot(data = gender_factor.bi.conf.all, aes(x = gender_factor, y = Avgtwo_year_recid)) +
  geom_bar(stat = "identity") +
  xlab("gender_factor") + 
  ylab("Average Rate") +
  ggtitle("Actual 2-year Recidivism") + 
  guides(fill = FALSE) +
  geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
  theme(text = element_text(size=12))


### RACE ###
# Bivariate w/ decile score 
race_factor.bi.conf.all <- recid.all %>%
  group_by(race_factor) %>%
  summarize(Avgdecile_score = mean(decile_score),
            upper = t.test(decile_score)$conf.int[1],
            lower = t.test(decile_score)$conf.int[2])

race_factor.bi.all.dec <- ggplot(data = race_factor.bi.conf.all, aes(x = race_factor, y = Avgdecile_score)) +  geom_bar(stat = "identity") +
  xlab("Race") + 
  ylab("Average prediction") +
  ggtitle("Prediction from COMPAS") + 
  guides(fill = FALSE) +
  geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1)  +
  theme(text = element_text(size=12), axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
# Bivariate w/ recid 
race_factor.bi.conf.all <- recid.all %>%
  group_by(race_factor) %>%
  summarize(Avgtwo_year_recid = mean(two_year_recid),
            upper = t.test(two_year_recid)$conf.int[1],
            lower = t.test(two_year_recid)$conf.int[2])

race_factor.bi.all.recid <- ggplot(data = race_factor.bi.conf.all, aes(x = race_factor, y = Avgtwo_year_recid)) +
  geom_bar(stat = "identity") +
  xlab("Race") + 
  ylab("Average Rate") +
  ggtitle("Actual 2-year Recidivism") + 
  guides(fill = FALSE) +
  geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1)  +
  theme(text = element_text(size=12), axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))

### AGE ###
# Bivariate w/ decile score 
age_factor.bi.conf.all <- recid.all %>%
  group_by(age_factor) %>%
  summarize(Avgdecile_score = mean(decile_score),
            upper = t.test(decile_score)$conf.int[1],
            lower = t.test(decile_score)$conf.int[2])

age_factor.bi.all.dec <- ggplot(data = age_factor.bi.conf.all, aes(x = age_factor, y = Avgdecile_score)) +  geom_bar(stat = "identity") +
  xlab("Age") + 
  ylab("Average prediction") +
  ggtitle("Prediction from COMPAS") + 
  guides(fill = FALSE) +
  geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
  theme(text = element_text(size=12))

# Bivariate w/ recid 
age_factor.bi.conf.all <- recid.all %>%
  group_by(age_factor) %>%
  summarize(Avgtwo_year_recid = mean(two_year_recid),
            upper = t.test(two_year_recid)$conf.int[1],
            lower = t.test(two_year_recid)$conf.int[2])

age_factor.bi.all.recid <- ggplot(data = age_factor.bi.conf.all, aes(x = age_factor, y = Avgtwo_year_recid)) +
  geom_bar(stat = "identity") +
  xlab("age") + 
  ylab("Average Rate") +
  ggtitle("Actual 2-year Recidivism") + 
  guides(fill = FALSE) +
  geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
  theme(text = element_text(size=12))

#Display the predictions versus actual recidivism rate averages
grid.arrange(gender_factor.bi.all.dec, gender_factor.bi.all.recid, ncol = 2)

grid.arrange(race_factor.bi.all.dec, race_factor.bi.all.recid, ncol = 2)

grid.arrange(age_factor.bi.all.dec, age_factor.bi.all.recid, ncol = 2)

Section 2: Methodology

Prior to starting the data mining process, we randomly split the data into training and validation sets.

#Somebody figure out how we're supposed to do this. I thought I understood it until we learned about cross-validation, which is better than regular validation. Also, should we do it here or before EDA? 

2a: Data Mining Methods and Performance

2a.1: Model Selection and validation

Possible models
[KR note to team: we may cut some of these if we a) don’t have time or b) think they’re the wrong models]
Given our task of predicting binary outcomes, we considered the following classification methods:
- Logistic regression:
- Naive Bayes (NB): This works when there is no interaction between predictors.
- Linear Discriminant Analysis (LDA): This works when the interaction between predictors is the same across classes.
- Quadratic Discriminant Analysis (QDA): This works when there are class-based interactions among the predictors
- Random Forrest (RF): - Classification Tree: Although not as trustworthy as random forrest, the potential for public transparency makes this worth trying.
- K-Nearest Neighbors (KNN):

Sormeh Attempts Logistic Regression

## Going to try with all of the factors for now, and then I will be more selective
## working with dataset = recid.all
#glm.fits = glm(two_year_recid ~age_factor+race_factor+gender_factor+crime_factor+age, data=recid.all, family=binomial)
glm.fits = glm(two_year_recid ~age_factor+race_factor+gender_factor+crime_factor+priors_count, data=recid.all, family=binomial)
#glm.fits = glm(two_year_recid ~ race_factor+gender_factor+crime_factor+age, data=recid.all, family=binomial)
summary(glm.fits)
## 
## Call:
## glm(formula = two_year_recid ~ age_factor + race_factor + gender_factor + 
##     crime_factor + priors_count, family = binomial, data = recid.all)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7430  -0.9690  -0.6501   1.0858   2.0686  
## 
## Coefficients:
##                              Estimate Std. Error z value   Pr(>|z|)    
## (Intercept)                 -0.608309   0.064512  -9.429    < 2e-16 ***
## age_factorGreater than 45   -0.668217   0.076095  -8.781    < 2e-16 ***
## age_factorLess than 25       0.733479   0.068937  10.640    < 2e-16 ***
## race_factorAfrican-American  0.095986   0.062700   1.531   0.125800    
## race_factorAsian            -0.550754   0.427799  -1.287   0.197950    
## race_factorHispanic         -0.170372   0.108815  -1.566   0.117419    
## race_factorNative American  -0.270741   0.650668  -0.416   0.677339    
## race_factorOther            -0.154898   0.127679  -1.213   0.225061    
## gender_factorFemale         -0.348609   0.071851  -4.852 0.00000122 ***
## crime_factorM               -0.218718   0.058759  -3.722   0.000197 ***
## priors_count                 0.165543   0.008068  20.518    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8506.4  on 6171  degrees of freedom
## Residual deviance: 7576.2  on 6161  degrees of freedom
## AIC: 7598.2
## 
## Number of Fisher Scoring iterations: 4
## To just see the coefficients
summary(glm.fits)$coef[,4]
##                 (Intercept)   age_factorGreater than 45 
##                4.122705e-21                1.614884e-18 
##      age_factorLess than 25 race_factorAfrican-American 
##                1.943911e-26                1.258000e-01 
##            race_factorAsian         race_factorHispanic 
##                1.979497e-01                1.174193e-01 
##  race_factorNative American            race_factorOther 
##                6.773393e-01                2.250612e-01 
##         gender_factorFemale               crime_factorM 
##                1.223076e-06                1.974087e-04 
##                priors_count 
##                1.477025e-93
## Create random training set and testing set
## adding a column of random 0 and 1s 
recid.all$is.test <- sample(c(0,1), size = nrow(recid.all), replace=TRUE)

## Partition recid.all data
recid.all.train <- subset(recid.all, subset = is.test == 0)
recid.all.test <- subset(recid.all, subset = is.test == 1)
## Predict function to predict the probability that someone will have recidivated in 2 years
## type="response" tells R to output probabilities of the form P(Y=1|X)
print("Predicing two_year_recid:")
## [1] "Predicing two_year_recid:"
glm.probs = predict(glm.fits, type="response")
glm.probs[1:10]
##         1         2         3         4         5         6         7         8 
## 0.1928770 0.3746489 0.7075113 0.2725100 0.8467446 0.4337475 0.2358377 0.3524450 
##         9        10 
## 0.6222399 0.2358377
#contrasts(two_year_recid) ## this does not work for some reason...
p <- ggplot(data = recid.all, aes(two_year_recid, fill = as.factor(glm.fits$y))) +
  geom_histogram(binwidth = 1)

ggplotly(p)
## Trying again but just with training data
## Predict function to predict the probability that someone will have recidivated in 2 years
## type="response" tells R to output probabilities of the form P(Y=1|X)
print("Predicing two_year_recid **Training Data**:")
## [1] "Predicing two_year_recid **Training Data**:"
glm.logit <- glm(two_year_recid ~ age_factor+race_factor+gender_factor+crime_factor+priors_count, data=recid.all.train, family = binomial)
summary(glm.logit)
## 
## Call:
## glm(formula = two_year_recid ~ age_factor + race_factor + gender_factor + 
##     crime_factor + priors_count, family = binomial, data = recid.all.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5637  -0.9931  -0.6260   1.0495   2.1180  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -0.45019    0.09047  -4.976 6.49e-07 ***
## age_factorGreater than 45   -0.71728    0.10876  -6.595 4.25e-11 ***
## age_factorLess than 25       0.78313    0.09819   7.976 1.52e-15 ***
## race_factorAfrican-American  0.03289    0.08905   0.369    0.712    
## race_factorAsian            -0.67250    0.59312  -1.134    0.257    
## race_factorHispanic         -0.07728    0.15398  -0.502    0.616    
## race_factorNative American  -0.26959    0.79542  -0.339    0.735    
## race_factorOther            -0.39688    0.18175  -2.184    0.029 *  
## gender_factorFemale         -0.52317    0.10231  -5.113 3.17e-07 ***
## crime_factorM               -0.36282    0.08333  -4.354 1.34e-05 ***
## priors_count                 0.15273    0.01096  13.938  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4248.7  on 3080  degrees of freedom
## Residual deviance: 3749.5  on 3070  degrees of freedom
## AIC: 3771.5
## 
## Number of Fisher Scoring iterations: 4
kable(coef(summary(glm.logit)), digits= 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.450 0.090 -4.976 0.000
age_factorGreater than 45 -0.717 0.109 -6.595 0.000
age_factorLess than 25 0.783 0.098 7.976 0.000
race_factorAfrican-American 0.033 0.089 0.369 0.712
race_factorAsian -0.672 0.593 -1.134 0.257
race_factorHispanic -0.077 0.154 -0.502 0.616
race_factorNative American -0.270 0.795 -0.339 0.735
race_factorOther -0.397 0.182 -2.184 0.029
gender_factorFemale -0.523 0.102 -5.113 0.000
crime_factorM -0.363 0.083 -4.354 0.000
priors_count 0.153 0.011 13.938 0.000
glm.probs = predict(glm.logit, type="response")
glm.probs[1:10]
##         1         2         3         4         5         6         7         8 
## 0.3971614 0.2297209 0.8439626 0.4039866 0.2081385 0.6132963 0.2081385 0.1768393 
##         9        10 
## 0.6574151 0.2910505
#contrasts(two_year_recid) ## this does not work for some reason...
p <- ggplot(data = recid.all.train, aes(two_year_recid, fill = as.factor(glm.logit$y))) +
  geom_histogram(binwidth = 1)

ggplotly(p)
#spam.logit
glm.probs = predict.glm(glm.logit, type="response")
confusion.glm = ifelse(glm.probs > 0.5, "1", "0")
conf.train.pred = table(confusion.glm, recid.all.train$two_year_recid)
conf.train.pred
##              
## confusion.glm    0    1
##             0 1286  594
##             1  386  815
#recid.test.logit <- glm(two_year_recid ~ age_factor+race_factor+gender_factor+crime_factor+age, data=recid.all.test, family=binomial)
recid.test.logit <- glm(two_year_recid ~ age_factor+race_factor+gender_factor+crime_factor+priors_count, data=recid.all.test, family=binomial)

glm.probs = predict.glm(recid.test.logit, type="response")
conf.test.glm = ifelse(glm.probs > 0.5, "1", "0")
conf.test.pred = table(conf.test.glm, recid.all.test$two_year_recid)
conf.test.pred
##              
## conf.test.glm    0    1
##             0 1328  652
##             1  363  748
# Misclassification
print("Train Misclassification:")
## [1] "Train Misclassification:"
1- sum(diag(conf.train.pred)) / sum(conf.train.pred)
## [1] 0.3180785
print("Test Misclassification:")
## [1] "Test Misclassification:"
1-sum(diag(conf.test.pred)) / sum(conf.test.pred)
## [1] 0.3283727
### COME BACK TO ME!!!
print("Come back here for editing...trying to plot the additive logistic models from homework 4 problem 3")
## [1] "Come back here for editing...trying to plot the additive logistic models from homework 4 problem 3"
#recid.formula <- formula(paste("two_year_recid ~ ", paste("s(", varinfo[,1], ", 4)", sep = "", collapse= " + ")))
# Edit me

#recid.gam <- gam(recid.formula, data=recid.all.train, family = binomial)
#summary(recid.gam)

#par(mfrow=c(15,4))
#plot(recid.gam, se=T, col="ForestGreen")

Sormeh Attempts Naive Bayes Could not find how we did this in lab, so following tutorial at uc-r.github.io/naive_bayes

## Create training (70%) and test (30%) sets for the attrition data
## Use set.seed for reproducibility

set.seed(123)
split <- initial_split(recid.all, prop = .7, strata = "two_year_recid")
train <- training(split)
test <- testing(split)

## Distribution of two_year_recid rates across train and test set
print("Training set distribution:")
## [1] "Training set distribution:"
table(train$two_year_recid) %>% prop.table
## 
##         0         1 
## 0.5448866 0.4551134
print("Testing set distribution:")
## [1] "Testing set distribution:"
table(test$two_year_recid) %>% prop.table
## 
##         0         1 
## 0.5448649 0.4551351

Bayesian probability incorporates the concept of conditional probability, the probability of event A given that event B has occurred [denoted as P(A|B)]. In the context of our data, we are seeking the probability of an individual belonging to two_year_recid = 1, given that its predictor values are [….. something] Here’s a plot that attempts to show correlation among variables. It is different from the way we’ve typically done it in class

## I guess this is a different way of seeing if variables are correlated with one another??
train %>%
  filter(two_year_recid == 1) %>%
  select_if(is.numeric) %>%
  cor() %>%
  corrplot::corrplot()
## Warning in stats::cor(x, ...): the standard deviation is zero

Using caret package for Naive Bayes Running into trouble here…

##Ok, I'm not sure why it asks to make features data, but it says to create response and feature data
features <- setdiff(names(train), "two_year_recid")
x <- train[, features]
y <- train$two_year_recid
#x <- as.factor(x)
#y <- as.factor(y)

## set up 10-fold cross validation procedure
train_control <- trainControl(method = "cv", number = 10)

## train model
#nb.m1 <- train(x = x, y = y, method = "nb", trControl = train_control)

## results
#confusionMatrix(nb.m1)

Moses work:

#Choosing variables
recid.subsets <- regsubsets(as.factor(two_year_recid) ~
                              crime_factor +
                              age +
                              age_factor +
                              race_factor +
                              gender_factor +
                              log2(priors_count + 1) +
                              priors_count +
                              juv_fel_count +
                              (juv_fel_count > 0) +
                              juv_misd_count +
                              (juv_misd_count > 0) +
                              juv_other_count,
                            data = recid.all,
                            nbest = 1,
                            nvmax = 15,
                            method = "exhaustive")
plot(x = summary(recid.subsets)$adjr2, xlab = "Number of Variables", ylab = "Adj. R-Squared")
maxr2 <- which.max(summary(recid.subsets)$adjr2)
points(maxr2, summary(recid.subsets)$adjr2[maxr2], col = "red", cex = 1, pch = 20)

LDA work

#LDA
#crime_factor, race_factor, and juv_other_count slightly improve the model & increase specificity (at the expense of sensitivity) - should we include them??
#
#without:                 acc = 67.8, sens = 60.6, spec = 73.9   DOMINATED BY EXCL. RACE, EXCL. CRIME
#all three:               acc = 68.1, sens = 60.4, spec = 74.5   Probably overfit
#exclude race_factor:     acc = 68.1, sens = 60.7, spec = 74.3   Highest sensitivity, accuracy
#exclude juv_other_count: acc = 67.9, sens = 60.0, spec = 74.5   DOMINATED BY INCL. JUV, ALL THREE
#exclude crime_factor:    acc = 68.0, sens = 60.6, spec = 74.2   DOMINATED BY EXCL. RACE
#include race_factor:     acc = 67.9, sens = 60.0, spec = 74.4   DOMINATED BY EXCL. JUV, ALL THREE
#include juv_other_count: acc = 68.0, sens = 60.1, spec = 74.6   Strong, high-specificity
#include crime_factor:    acc = 67.9, sens = 59.8, spec = 74.7   Highest specificity

recid.lda <- lda(two_year_recid ~
                   #crime_factor +
                   age +
                   (age_factor == "Less than 25") +
                   #race_factor +
                   gender_factor +
                   #juv_other_count +
                   log2(priors_count + 1),
                 data = recid.all)
                 #subset = train.all)
#How do I get it to predict the test set?
recid.lda.pred <- predict(recid.lda,
                          type = "response")
lda.tab = table(predicted = recid.lda.pred$class,
                actual = recid.all$two_year_recid)
lda.tab
##          actual
## predicted    0    1
##         0 2484 1106
##         1  879 1703
lda.acc <- (lda.tab[1,1] + lda.tab[2,2]) / sum(lda.tab)
lda.sens <- lda.tab[2,2] / sum(lda.tab[,2])
lda.spec <- lda.tab[1,1] / sum(lda.tab[,1])
tibble(Accuracy = scales::percent(lda.acc, accuracy = 0.1),
       Sensitivity = scales::percent(lda.sens, accuracy = 0.1),
       Specificity = scales::percent(lda.spec, accuracy = 0.1))
## # A tibble: 1 x 3
##   Accuracy Sensitivity Specificity
##   <chr>    <chr>       <chr>      
## 1 67.8%    60.6%       73.9%
recid.qda <- qda(two_year_recid ~
                   #crime_factor +
                   age +
                   (age_factor == "Less than 25") +
                   gender_factor +
                   log2(priors_count + 1),
                 data = recid.all)
recid.qda.pred <- predict(recid.qda,
                          type = "response")
qda.tab = table(predicted = recid.qda.pred$class,
                actual = recid.all$two_year_recid)
qda.tab
##          actual
## predicted    0    1
##         0 2329  982
##         1 1034 1827
qda.acc <- (qda.tab[1,1] + qda.tab[2,2]) / sum(qda.tab)
qda.sens <- qda.tab[2,2] / sum(qda.tab[,2])
qda.spec <- qda.tab[1,1] / sum(qda.tab[,1])
tibble(Accuracy = scales::percent(qda.acc, accuracy = 0.1),
       Sensitivity = scales::percent(qda.sens, accuracy = 0.1),
       Specificity = scales::percent(qda.spec, accuracy = 0.1))
## # A tibble: 1 x 3
##   Accuracy Sensitivity Specificity
##   <chr>    <chr>       <chr>      
## 1 67.3%    65.0%       69.3%

We then decided to pursue [which models should we use?]

#Run these models on training data
#Run cross-validation on these models
#Select the best version of each model based on a low CV
  #Update after meeting: could we actually decide based on high specificity/sensitivity, or is that not legit?
#Output line graphs showing how training and CV error levels work (I think this is a crucial part of the prompt)

#Update after meeting: perhaps first run w/o including race, but then in the discussion section, re-run the final models including race and see if they become less racist?

To decide between models, we looked at three measures of error: -Accuracy: This is the percent of predictions that were correct. It is a limited measure of the tool’s usefulness because it counts false positives and false negatives in the same way, when their human impact is much different.
-Specifity: This shows the true negative rate. A higher number shows that we are not mistakenly classifying people as high-risk. general, we believe this is the most important measure because it keeps innocent people from receiving harsh punishments.
-Sensitivity: This shows the true positive rate. A higher number shows that we successfully predicted people would commit crimes. For violent crime, a high sensitivity rate is important.

Statistics for the best version of each data mining method
[fill this in once we’ve run the models – this is just a template]

Model for any recidivism Accuracy Sensitivity Specificity
Logistic regression
Linear Discriminant Analysis (LDA)
Quadratic Discriminant Analysis (QDA)
Naive Bayes (NB)
Random Forrest (RF)
Model for violent recidivism Accuracy Sensitivity Specificity
Logistic regression
Linear Discriminant Analysis (LDA)
Quadratic Discriminant Analysis (QDA)
Naive Bayes (NB)
Random Forrest (RF)

[Let’s add some graphs here that show tradoffs between different measures that we care about (like sensitivity vs. specificity. Caulkins would be proud)]

2c: Demographic analysis of the best version of each model:
After choosing the best version of each model, we assessed the potential for prejudice. Our goal was to replicate Pro-Publica’s chart for each of these metrics included in the introduction. [If possible, maybe create some charts comparing models on accuracy vs. prejudice tradeoffs]

2c1: Race

2c2: Gender

2c3: Age

2b: Final model(s)

Our final model is…. This has an accuracy rate of [], specifity of []. We chose it because…

#Add pretty figures about our final model(s)

Section 3: Key Findings and Takeaways

3a: Comparing our tool to the COMPAS tool

3a.1: Does our RAIs perform better or worse than COMPAS?
3a.2: Do our RAIs produce similar classifications to COMPAS?
3a.3: Are there systematic differences between our classifications and those of COMPAS?

3b: Reflections on risk assessment tools

  • We know that arrest data is biased based on where police patrol, so even a model that’s interally sound is still based on biased data.[maybe cite Patrick Ball]
  • Pittsburgh has a pre-trial risk assessment tool
  • Models can be racist without using race as an input…[maybe add more about our research]
  • Black box vs. accessibility

Perhaps cite some articles here, like the two that ProPublica cited, or something more recent Columbia article on Risk as proxy for race (2010) Article on Risk, Race, and Recidivism

References

ProPublica main article
ProPublica methodology
ProPublica github
Raw version of COMPAS survey